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We propose a Parton Distribution Function (PDF) fitting technique which is based on an interac- 
tive neural network algorithm using Self-Organizing Maps (SOMs). SOMs are visualization al- 
gorithms based on competitive learning among spatially-ordered neurons. Our SOMs are trained 
with stochastically generated PDF samples. On every optimization iteration the PDFs are clus- 
tered on the SOM according to a user-defined feature and the most promising candidates are 
selected as a seed for the subsequent iteration. Our main goal is thus to provide a fitting proce- 
dure that, at variance with the global analyses and standard neural network approaches, allows 
for an increased control of the systematic bias by enabling user interaction in the various stages 
of the fitting process. 
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1. Introduction 

The cross sections for a number of hadronic reactions can be computed using perturbative 
Quantum Chromodynamics (pQCD) convoluting the perturbatively calculable hard scattering co- 
efficients with the non perturbative Parton Distribution Functions (PDFs), that parametrize the large 
distance hadronic structure. The accuracy with which the theoretical predictions for observables 
of such reactions can be compared against the high precision experimental data thus depends, not 
only on the accuracy of the hard scattering part calculations, but also on the accuracy with which 
the PDFs are known. 

Currently, the established method to obtain the PDFs, used by the major PDF collaborations 
(CTEQ ^ and references within, MRST [g, Alekhin [||], Zeus [§ and HI [|]), is the global 
analysis supplemented with an error estimation using some kind of variant of the Hessian method 
(see e.g. for details). This powerful combination allows for both extrapolation outside the 
kinematical range of the data and extension to multivariable cases, such as nuclear PDFs. However, 
there are uncertainties related to the method itself, that are difficult to quantify, but may turn out to 
have a large effect. The differences between the current global PDF sets indeed tend to be larger 
than the estimated uncertainties [Q], and these differences again translate to the predictions for the 
LHC observables, such as Higgs [^] or and Z production cross sections [|]]. For details of PDF 
uncertainty studies see e.g. Refs. [§]. 

Another approach to the PDF fitting has recently been proposed by the NNPDF collaboration 



[]10|], who have replaced typical functional form ansatze used in global analyses with more complex 
standard neural network (NN) solutions, and the Hessian method with Monte Carlo (MC) sampling 
of the data. The NNPDF method circumvents many of the problems global analyses suffer, such 
as bias resulting from fixing a functional form and selecting a suitable tolerance A;^^ needed in 
Hessian method, and it relies on genetic algorithm (GA) which works on a population of solutions 
for each MC replica of the data, thus having a lowered possibility of getting fixed in local minima. 
The estimated uncertainties for NNPDF fits are larger than those of global fits, possibly indicating 
that the global fit uncertainties may have been underestimated. The complexity of NN results, 
however, may also induce problems, especially when used in a purely automated fitting procedure. 
Since the effect of modifying individual NN parameters is unknown, the result may exhibit strange 
or unwanted behaviour in the extrapolation region, or in between the data points if the data is sparse. 
Implementation of information not given directly by the data, such as nonperturbative models, 
lattice calculations or knowledge from prior work in general, is also difficult in this approach. 



The new PDF fitting method we have recently proposed in Ref. JllQ relies on the use of Self- 
Organizing Maps (SOMs), a subtype of neural network. The idea of our method is to create means 
for introducing "Researcher Insight" instead of "Theoretical bias" by giving up a fully automated 
fitting procedure, and eventually to develop an interactive fitting program which would allow us to 
combine the best features of both the global analysis and the NNPDF approach. 

2. Self-Organizing Maps 

The SOM [ p^ is a visualization algorithm which attempts to represent all the available obser- 
vations with optimal accuracy using a restricted set of models. SOM consists of nodes, map cells. 
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which are all assigned spatial coordinates, and the topology of the map is determined by a chosen 
distance metric M^ap- Each cell / contains a map vector V,, that is isomorphic to the data samples 
used for training of the neural network. For a simple 2-dimensional rectangular lattice, our choice 
for the SOM shape, a natural choice for the topology is L\ {x,y) = Y}=\ ~yi\- 

The implementation of SOMs proceeds in three stages: 1) initialization of the SOM (see 
Fig. [I]), 2) training of the SOM (Fig. [l]) and 3) associating the data samples with a trained map, 
i.e. clustering. For the details of the SOM implementation, see [pl|]. 




Figure 1: Left: SOM initialization. Right: SOM training. 

In the end of the training stage, cells that are topologically close to each other have map vectors 
which are most similar to each other (according to a chosen similarity metric Mdata) compared to 
all the other map vectors. In the matching phase the actual data is matched against the map vectors 
of the trained map, and thus get distributed on the map according to the feature that was used as the 
similarity criterion. Clusters now emerge as a result of unsupervised learning. This local similarity 
property is the feature that makes SOM suitable for visualization purposes, thus facilitating user 
interaction with the data. Since each map vector now represent a class of similar objects, the SOM 
is an ideal tool to visualize high-dimensional data, by projecting it onto a low-dimensional map 
clustered according to some desired similar feature. 

In our work we used the so-called batch-version of the training, in which all the training data 
samples are matched against the map vectors before the training begins. The map vectors are 
then averaged with all the training samples within the neighbourhood radius simultaneously. The 
procedure is repeated N^i^^ (free parameter to choose) times such that in every training step the 
same set of training data samples is associated with the evolving map The benefit of the batch 
training compared to the incremental training, shown in Fig. is that the training is independent 
of the order in which the training samples are introduced on the map. 

3. ENVPDF algorithm 

The aim of our approach is to both i) to be able to study the properties of the PDFs in a model 
independent way and yet ii) to be able to implement knowledge from the prior works on PDFs, 
and ultimately iii) to be able to guide the fitting procedure interactively with the help of the SOM 
properties. 
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To accomplish this, we choose, at variance with the "conventional" PDFs sets or NNPDFs, 
to give up the functional form for the PDFs and rather to rely on purely stochastical methods in 
generating the initial and training PDF samples. Our choice is a GA-type analysis, in which our 
parameters are the values of PDFs at the initial scale for each flavour at each value of x where the 
experimental data exist. To obtain control over the shape of the PDFs we use some of the existing 
distributions to establish an initial range, or envelope, within which we sample the candidate PDF 
values. 



For now we concentrate on DIS structure function data from HI [13], BCDMS [14] and Zeus 
[15], which we use without additional kinematical cuts or normalization factors. The parameters 
for the DGLAP scale evolution were chosen to be those of CTEQ6 (CTEQ6L1 for lowest order 



(LO)) [16], the initial scale being Qq = 1.3 GeV. In next-to-leading order (NLO) case the evolution 
code was taken from [|T7|] (QCDNUM17 beta release). 

We use CTEQ6 [|l6|], CTEQ5 MRST02 [g ||], Alekhin % and GRV98 [|^] PDF sets 
as our init PDFs. We construct our initial PDF generator first to, for each flavour separately, select 
randomly either the range [0.5, 1], [1.0, 1.5] or [0.75, 1.25] times any of the init PDF set. Next the 
initial generators generate values for each x^^^a ( To ensure a reasonable large-x behaviour for the 
PDFs, we also generate with the same method values for them in a few x-points outside the range 
of the experimental data. For simplicity we also require the gluons to be positive in NLO.) using 
uniform, instead of Gaussian, distribution around the init PDFs, thus reducing direct bias from 
them. Gaussian smoothing is applied to the resulting set of points, and the flavours combined to 
form a PDF set such that the curves are linearly interpolated from the discrete set of generated 
points, and scaled to conserve momentum, baryon number and charge. In this study we accept 
the <few% normalization error which results from the fact that our x-range is not x= [0, 1], but 
X = [min(xdata)5 1]- We call these type of PDF sets database PDFs. 

For a. N xN SOM we choose the size of the database to be 4N^. We randomly initialize the 
map with database PDFs sets, such that each map vector V, consists of the PDF set itself, and 
of the observables F2{x,Q^) derived from it, and train the map with N^tep batch-training steps. 
In order to obtain a reasonable selection of PDFs to start with, we reject candidates which have 
x'^/N > 10. We choose the similarity criterion to be the similarity of observables F2{x,Q^) with 
^data = ^1- The similarity is tested at every Xdata-values both at the initial scale and at all the 
evolved scales where experimental data exist. On every training step, after the matching, all the 
observables (PDFs) of the map vectors get averaged with the observables (PDFs, flavor by flavor) 
matched within the neighbourhood. The resulting new averaged map vector PDFs are rescaled 
again to obey the sumrules. We call these type of PDF sets map PDFs. The map PDFs are evolved 
and the observables at every experimental data scale are computed and compared for similarity with 
the observables from the training PDFs. After the training we have a map with N map PDFs and 
the same AN"^ database PDF sets we used to train the map. This is the end of the first optimization 
iteration. 

During the later iterations we proceed as follows: At the end of each iteration we pick from the 
trained SOM 2N best PDFs as the init PDFs. These init PDFs are introduced into the training set 
alongside with the database PDFs, which are now constructed using each of the init PDFs in turn 
as a center for a Gaussian random number generator, which assigns for all the flavours for each x 
a value around that same init PDF such that 1 — a of the generator is given by the spread of the 
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best PDFs in the topologically nearest neighbouring cells. The object of these generators is thus to 
refine a good candidate PDF found in the previous iteration by jittering its values within a range 
determined by the shape of other good candidate PDFs from the previous iteration. The generated 
PDFs are then smoothed and scaled to obey the sumrules. Sets with /N > 10 are always rejected. 
It is important to preserve the variety of the PDF shapes on the map, so we also keep A'^orig copies of 
the first iteration generators in our generator mix. Since the best PDF candidates from the previous 
iteration are matched on this new map as an unmodified init PDF, it is guaranteed that the X^/^ 
a function of the iteration either decreases or remains the same. We keep repeating the iterations 
until the X'^/^ saturates. 

The best x^/N values of the original init PDFs^ are 1.67 for LO (CTEQ6) and 1.89 for NLO 
(MRST02), and Table |I] lists results from a variety of ENVPDF runs. The results do not seem 
to be very sensitive to the number of SOM training steps, A^step> but are highly sensitive to the 
number of first iteration generators used in subsequent iterations. Although the generators can 
now in principle produce an infinite number of different PDFs, the algorithm would not be able 
to radically change the shape of the database PDFs without introducing a random element on the 
map. Setting A/^ong > provides, through map PDFs, that element, and keeps the algorithm from 
getting fixed to a local minimum. 



SOM 


A^step 


A^orig 


LO xVN 


NLO x^/N 


5x5 


5 


2 


1.04 


1.08 


5x5 


10 


2 


1.10 




5x5 


20 


2 


1.10 




5x5 


30 


2 


1.10 




5x5 


40 


2 


1.08 




5x5 


5 





1.41 




15x15 


5 


6 


1.00 


1.07 



Table 1: x^/N for variety of ENVPDF runs against all the datasets (HI, ZEUS, BCDMS, N=709). 

Due to the stochastical nature of the ENVPDF algorithm, we may well study the combined 
results from several separate runs. It is especially important to verify the stability of our results, to 
show that the results are indeed reproducible instead of lucky coincidences. Left panel of Fig. ^ 
presents the best NLO results, and the combined X^/^ < 1.2 spreads of the PDFs from any itera- 
tion, for 10 repeated 5x5, A^step = 5 runs at the initial scale. The average x'^/^ the standard 
deviation a for these runs are 1.122 and 0.029, corresponding to Ax^ 20. The right panel of the 
same Fig. ^ shows the 10 best result curves and the X^/^ < 1.2 spreads evolved up to 2 = 3.0 
GeV. Since we have only used DIS data in this study, we are only able to explore the small-;c un- 
certainty for now, and expectedly, the small-^; gluons obtain the largest uncertainty for all the cases 
we studied. 

Clearly the seemingly large difference between the small-^: gluon results at the initial scale is 
not statistically significant, but gets smoothed out during the course of the QCD evolution. The 

'These are the X^/^ for the initial scale PDF sets taken from the quoted parametrizations and evolved with CTEQ6 
DGLAP settings, no kinematical cuts or normalization factors for the experimental data were imposed. We do not claim 
these values to describe the quality of the quoted PDF sets. 
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evolved curves also preserve the initially set baryon number scaling within 0.5% and momentum 
sumrule within 1.5% accuracy. Thus the initial scale wiggliness of the PDFs is mainly only a 
residual effect from our method of generating them and not linked to the overtraining of the SOM. 

Therefore our simple method of producing the candidate PDFs by jittering random numbers 
inside a predetermined envelope is surprisingly stable when used together with a complicated PDF 
processing that SOMs provide. Remarkably then, even a single SOM run can provide a quick 
uncertainty estimate for a chosen /^x^ without performing a separate error analysis. 




X X 

Figure 2: NLO ENVPDF best results and the /N < 1 .2 spreads of results from 10 separate runs. 



4. Future of the SOMPDFs 

So far we have shown a relatively straightforward method of obtaining stochastically gener- 
ated, parameter-free, PDFs, with an uncertainty estimate for a desired A;^^. However, the proposed 
method can be extended much further than that. What ultimately sets the SOM method apart from 
the standard global analyses or NNPDF method are the clustering and visualization possibilities 
that it offers. Instead of setting Mjjata = and clustering according to the similarity of the observ- 
ables, it is possible to set the clustering criteria to be anything that can be mathematically quantified, 
e.g. the shape of the gluons or the large-.*; behaviour of the PDFs. The desired feature of the PDFs 
can then be projected out from the SOM. Moreover, by combining the method with an interactive 
graphic user interface (GUI), it would be possible to change and control the shape and the width 
of the envelope as the minimization proceeds, to guide the process by applying researcher insight 
at various stages of the process, and the uncertainty band produced by the SOM could further help 
the user to make decisions about the next steps of the minimization. With GUI it would be e.g. 
possible to set the generators to sample a vector consisting of PDF parameters, instead of values of 
PDFs in each value of x of the data. That would lead to smooth, continuous type of solutions, either 
along the lines of global analyses, or NNPDFs using N SOMs for A'^ Monte-Carlo sampled replicas 
of the data. For such a method, all the existing error estimates, besides an uncertainty band pro- 
duced by the map, would be applicable as well. Since the solution would be required to stay within 
an envelope of selected width and shape, no restrictions for the parameters themselves would be 
required, and it would be possible to e.g. to constrain the extrapolation of the NN generated PDFs 
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outside the x-range of the data without explicitly introducing terms to ensure the correct small- and 
large-x behaviour as in NNPDF method. The selection of the best PDF candidates for the subse- 
quent iteration could then be made based on the user's preferences instead of solely based on the 
/N. That kind of method in turn could be extended to more complex hadronic matrix elements, 
such as the ones defining the GPDs, which are natural candidates for future studies of cases where 
the experimental data are not numerous enough to allow for a model independent fitting, and the 
guidance and intuition of the user is therefore irreplaceable. The possibilities of such a method are 
widely unexplored. 
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